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Abstract 

In this paper, we propose a computationally efficient approach — space (Sparse 
PArtial Correlation Estimation) — for selecting non-zero partial correlations 
under the high-dimension-low-sample-size setting. This method assumes the 
overall sparsity of the partial correlation matrix and employs sparse regression 
techniques for model fitting. We illustrate the performance of space by exten- 
sive simulation studies. It is shown that space performs well in both non-zero 
partial correlation selection and the identification of hub variables, and also out- 
performs two existing methods. We then apply space to a microarray breast 
cancer data set and identify a set of hub genes which may provide important 
insights on genetic regulatory networks. Finally, we prove that, under a set of 
suitable assumptions, the proposed procedure is asymptotically consistent in 
terms of model selection and parameter estimation. 

key words: concentration network, high-dimension-low-sample-size, lasso, 
shooting, genetic regulatory network 
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1 INTRODUCTION 



There has been a large amount of hterature on covariance selection: the identifica- 
tion and estimation of non-zero entries in the inverse covariance matrix fa.k.a. concen- 



tration matrix or precision matrix) starting from the seminal paper by Dempster (1972) 



Covariance selection is very useful in elucidating associations among a set of ran- 
dom variables, as it is well known that non-zero entries of the concentration ma- 
trix correspond to non-zero partial correlations. Moreover, under Gaussianity, non- 
zero entries of the concentration matrix imply conditional dependency between cor- 
responding variable pairs conditional on the rest of the variables (lEdward 2000p . 
Traditional methods does not work unless the sample size (n) is larger than the 
number of variables (p) (IWhittaker 1990} lEdward 20000 . Recently, a number of 
methods have been introduced to perform covariance selection for data sets with 



p > n, for example, see Meinshausen and Buhlmann (2006) , Yuan and Lin (2007) 



Li and Gui (2006) , Schafer and Strimmer (2007) 



In this paper, we propose a novel approach using sparse regression techniques for 
covariance selection. Our work is partly motivated by the construction of genetic 
regulatory networks ( GRN) based on high dimensional gene expression data. Denote 
the expression levels of p genes as ■ ■ ■ , A concentration network is defined as an 
undirected graph, in which the p vertices represent the p genes and an edge connects 
gene i and gene j if and only if the partial correlation p*-' between Ui and Uj is non- 
zero. Note that, under the assumption that yi, - ■ ■ ,yp are jointly normal, the partial 
correlation p^^ equals to Corr(yj, ?/j|?/_(jj)), where y~{ij) = {yu ■ ^ ^ k hj ^ p}- 
Therefore, p^^ being nonzero is equivalent to y^ and yj being conditionally dependent 
given all other variables y-(i,j)- The proposed method is specifically designed for 
the high-dimension-low-sample-size scenario. It relies on the assumption that the 
partial correlation matrix is sparse (under normality assumption, this means that 



3 



most variable pairs are conditionally independent), which is reasonable for many 
real life problems. For instance, it has been shown that most genetic networks are 
intrinsically sparse (IGardner et al. 20031 Jeong et al. 2001 Tegner et al. 2003 ). The 
proposed method is also particularly powerful in the identification of hubs: vertices 
(variables) that are connected to (have nonzero partial correlations with) many other 
vertices (variables). The existence of hubs is a well known phenomenon for many large 
networks, such as the internet, citation networks, and protein interaction networks 
( INewman 2003p . In particular, it is widely believed that genetic pathways consist 
of many genes with few interactions and a few hub genes with many interactions 
( IBarabasi and Oltvai 2004^ . 

Another contribution of this paper is to propose a novel algorithm active-shooting 
for solving penalized optimization problems such as lasso (ITibshirani 1996|l . This 
algorithm is computationally more efficient than the original shooting algorithm. 



which was first proposed by Fu (1998) and then extended by many others including 



Genkin et al. (2007) and Friedman et al. (2007a) It enables us to implement the 



proposed procedure efficiently, such that we can conduct extensive simulation stud- 
ies involving ~ 1000 variables and hundreds of samples. To our knowledge, this is 
the first set of intensive simulation studies for covariance selection with such high 
dimensions. 

A few methods have also been proposed recently to perform covariance selection 
in the context oi p ^ n. Similar to the method proposed in this paper, they all 



assume sparsity of the partial correlation matrix. Meinshausen and Buhlmann (2006) 



introduced a variable-by-variable approach for neighborhood selection via the lasso 
regression. They proved that neighborhoods can be consistently selected under a set 
of suitable assumptions. However, as regression models are fitted for each variable 
separately, this method has two major limitations. First, it does not take into account 
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the intrinsic symmetry of the problem (i.e., p^^ = p-'*). This could result in loss of 
efficiency, as well as contradictory neighborhoods. Secondly, if the same penalty 
parameter is used for all p lasso regressions as suggested by their paper, more or less 
equal effort is placed on building each neighborhood. This apparently is not the most 
efficient way to address the problem, unless the degree distribution of the network is 
nearly uniform. However, most real life networks have skewed degree distributions. 



such as the power-law networks. As observed by Schafer and Strimmer (2007) , the 
neighborhood selection approach limits the number of edges connecting to each node. 
Therefore, it is not very effective in hub detection. On the contrary, the proposed 
method is based on a joint sparse regression model, which simultaneously performs 
neighborhood selection for all variables. It also preserves the symmetry of the problem 
and thus utilizes data more efficiently. We show by intensive simulation studies that 
our method performs better in both model selection and hub identification. Moreover, 
as a joint model is used, it is easier to incorporate prior knowledge such as network 
topology into the model. This is discussed in Section 12. 1[ 

Besides the regression approach mentioned above, another class of methods em- 



ploy the maximum likelihood framework. Yuan and Lin (2007) proposed a penal- 
ized maximum likelihood approach which performs model selection and estimation 
simultaneously and ensures the positive definiteness of the estimated concentration 
matrix. However, their algorithm can not handle high dimensional data. The largest 
dimension considered by them is p = 10 in simulation and p = 5 in real data. 



Friedman et al. (2007b) proposed an efficient algorithm glasso to implement this 



method, such that it can be applied to problems with high dimensions. We show 
by simulation studies that, the proposed method performs better than glasso in 



both model selection and hub identification. Rothman et al (2008) proposed an- 



other algorithm to implement the method of Yuan and Lin (2007) The compu- 
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tational cost is on the same order of glasso, but in general not as efficient as 



glasso. Li and Gui (2006) introduced a threshold gradient descent (TGD) regu- 



larization procedure. Schafer and Strimmer (2007) proposed a shrinkage covariance 



estimation procedure to overcome the ill-conditioned problem of sample covariance 
matrix when p > n. There are also a large class of methods covering the situation 
where variables have a natural ordering, e.g., longitudinal data, time series, spatial 



data, or spectroscopy. See Wu and Pourahmadi (2003) , Bickel and Levina (2008 



Huang et al. (2006) and Levina et al (2006) , which are all based on the modified 



Cholesky decomposition of the concentration matrix. In this paper, we, however, 
focus on the general case where an ordering of the variables is not available. 

The rest of the paper is organized as follows. In Section 2, we describe the joint 
sparse regression model, its implementation and the active-shooting algorithm. In 
Section 3, the performance of the proposed method is illustrated through simulation 
studies and compared with that of the neighborhood selection approach and the 
likelihood based approach glasso. In Section 4, the proposed method is applied 
to a microarray expression data set of n = 244 breast cancer tumor samples and 
p = 1217 genes. In Section 5, we study the asymptotic properties of this procedure. 
A summary of the main results are given in Section 6. Technique details are provided 
in the Supplemental Material. 

2 METHOD 

2.1 Model 

In this section, we describe a novel method for detecting pairs of variables having 
nonzero partial correlations among a large number of random variables based on 
i.i.d. samples. Suppose that, {yi,--- lUp)^ has a joint distribution with mean 
and covariance S, where S is a p by p positive definite matrix. Denote the partial 
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correlation between Ui and Uj by p*-' (1 < i < j < p). It is defined as Corr(ej, e^), where 
ej and are the prediction errors of the best linear predictors of Ui and yj based on 
y-{i,j) = {Vk '■ ^ ^ k ^ i, j < p}, respectively. Denote the concentration matrix 
by (cr*^)pxp- It is known that, p*^ = -^0^. Let V-i := {Vk ■ I < k i < p}. The 
following well-known result (Lemma [1]) relates the estimation of partial correlations 
to a regression problem. 

Lemma 1 .- For 1 < i < p, yi is expressed as yi = Yliij^iPijUj + ^'^^^ ^^^^ ^« ^■^ 



uncorrelated with y_i if and only if Pij = — ^ = p*"'y Moreover, for such defined 



Pij, Var(ej) — ^,Cov(ej,ej) — -^r^- 

Note that, under the normality assumption, p*-' = Corr(yj, and in Lemma 



the search for non-zero partial correlations can be viewed as a model selection problem 
under the regression setting. In this paper, we are mainly interested in the case 
where the dimension p is larger than the sample size n. This is a typical scenario 
for many real life problems. For example, high throughput genomic experiments 
usually result in data sets of thousands of genes for tens or at most hundreds of 
samples. However, many high-dimensional problems are intrinsically sparse. In the 
case of genetic regulatory networks, it is widely believed that most gene pairs are 
not directly interacting with each other. Sparsity suggests that even if the number 
of variables is much larger than the sample size, the effective dimensionality of the 
problem might still be within a tractable range. Therefore, we propose to employ 
sparse regression techniques by imposing the i\ penalty on a suitable loss function to 
tackle the high-dimension-low-sample-size problem. 

Suppose Y'^ = ■ ■ ■ , y^^ are i.i.d. observations from (0, E), for /c = 1, ■ ■ ■ , n. 
Denote the sample of the ^th variable as 'Yi = - ■ ■ , y"^^^ . Based on Lemma [H we 




[H we can replace "uncorrelated" with "independent". Since p^^ = sign{Pij)^y PijP. 
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propose the following joint loss function 

1 ^ 

L49,a,Y) = -(J2^u.\\Y,-Y,P^,^^\') 

i=l j^i 

= i(E-.l|Y.-EP«y?Y,|P), (1) 

1 = 1 



where 9 = {p^^,--- ,P^P-^^''f, a = {a''}^^^; Y = {Y''}^^^; and w = {tfij^i are 
nonnegative weights. For example, we can choose Wi = 1/Var(ej) = cr" to weigh 
individual regressions in the joint loss function according to their residual variances, 
as is done in regression with heteroscedastic noise. We propose to estimate the partial 
correlations 6 by minimizing a penalized loss function 

Cn{e, a, Y) = Ln{e, a, Y) + J{e), (2) 

where the penalty term J^{6) controls the overall sparsity of the final estimation of 6. 
In this paper, we focus on the ii penalty (ITibshirani 1996^ : 

:r(^) = A||^||i = A \P''\- (3) 

l<i<j<p 

The proposed joint method is referred to as space (Sparse PArtial Correlation Es- 



timation) hereafter. It is related to the neighborhood selection approach by Meinshausen and Buhlmann (200 
(referred to as MB hereafter) , where a lasso regression is performed separately for each 
variable on the rest of the variables. However, space has several important advan- 
tages. 

(i) In space, sparsity is utilized for the partial correlations 6* as a whole view. 
However, in the neighborhood selection approach, sparsity is imposed on each 
neighborhood. The former treatment is more natural and utilizes the data 
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more efficiently, especially for networks with hubs. A prominent example is the 
genetic regulatory network, where master regulators are believed to exist and 
are of great interest. 

(ii) According to Lemma [H /3ij and jSji have the same sign. The proposed method 
assures this sign consistency as it estimates {p*-'} directly. However, when fit- 
ting p separate (lasso) regressions, it is possible that sign{f3ij) is different from 
sign(/3jj), which may lead to contradictory neighborhoods. 

(iii) Furthermore, the utility of the symmetric nature of the problem allows us to 
reduce the number of unknown parameters in the model by almost half {p{p + 
l)/2 for space vs. {p — 1)^ for MB), and thus improves the efficiency. 

(iv) Finally, prior knowledge of the network structure are often available. The joint 
model is more flexible in incorporating such prior knowledge. For example, 
we may assign different weights Wi to different nodes according to their "im- 
portance". We have already discussed the residual variance weights, where 
Wi = a". We can also consider the weight that is proportional to the (esti- 
mated) degree of each variable, i.e., the estimated number of edges connecting 
with each node in the network. This would result in a preferential attachment 
effect which explains the cumulative advantage phenomena observed in many 
real life networks including GRNs (IBarabasi and Albert 19991) . 

These advantages help enhance the performance of space. As illustrated by the 
simulation study in Section [3l the proposed joint method performs better than the 
neighborhood selection approach in both non-zero partial correlation selection and 
hub detection. 

As compared to the penalized maximum likelihood approach glasso (IFriedman et al. 2007bp . 
the simulation study in Section [3] shows that space also outperforms glasso in both 
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edge detection and hub identification under all settings that we have considered. In 
addition, space has the following advantages. 

(i) The complexity of glasso is 0{p^), while as discussed in Section 2.2, the space 
algorithm has the complexity of min(0(?7,p^), 0{p^)), which is much faster than 



the algorithm of Yuan and Lin (2007) and in general should also be faster than 



glasso when n < p, which is the case in many real studies. 

(ii) As discussed in Section 6, space allows for trivial generalizations to other penal- 
ties of the form of |p*-' |'' rather than simply |p*"' |, which includes ridge and bridge 
(IFu 199811 or other more complicated penalties like SCAD (IFan and Li 200l"|l . 



The glasso algorithm, on the other hand, is tied to the lasso formulation and 
cannot be extended to other penalties in a natural manner. 

(iii) In Section 5, we prove that our method consistently identifies the correct net- 
work neighborhood when both n and p go to oo. As far as we are aware, no such 
theoretical results have been developed for the penalized maximum likelihood 
approach. 

Note that, in the penalized loss function ([2]), a needs to be specified. We propose 
to estimate 9 and a by a two-step iterative procedure. Given an initial estimate a^^^ of 
(T, 9 is estimated by minimizing the penalized loss function ([2]), whose implementation 
is discussed in Section I2l2l Then given the current estimates 6'*^'^^ and a^'^\ a is updated 



based on Lemmaffi 1/a" = },\\Y,-j:,^JifY,\\^ where P^f = We 
then iterate between these two steps until convergence. Since 1/cr** < Var(?/j) = an, 
we can use l/au as the initial estimate of a**, where an = Ylk=iiyi ~ Vi)^ 
sample variance of yi. Our simulation study shows that, it usually takes no more 
than three iterations for this procedure to stabilize. 
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2.2 Implementation 

In this section, we discuss the implementation of the space procedure: that is, mini- 
mizing ([2]) under the ii penalty Q. We first re-formulate the problem, such that the 
loss function ([T]) corresponds to the £2 loss of a "regression problem." We then use 
the active-shooting algorithm proposed in Section [2l3] to solve this lasso regression 
problem efficiently. 

Given a and positive weights w, let y = (Yf , Yj)-^ be a rap x 1 column vector, 
where Yj = y/Wi Yi {i = 1, ■ ■ ■ and let X = (A'(i2),--- ,^Y(p_i,p)) be a np by 
p{p — l)/2 matrix, with 

^(,,) = (0,...,0, ^YJ, 0,...,0, ygYf, 0,...,0r 

T T 

i^'^block /''block 

where a** = a^'^/wi {i = I,-- - ,p). Then it is easy to see that the loss func- 
tion ([1]) equals to j\\y — X6\W, and the corresponding £1 minimization problem 
is equivalent to: mine|||3^ — ^6*112 + Note that, the current dimension 

n = np and p = p{p — l)/2 are of a much higher order than the original n and 
p. This could cause serious computational problems. Fortunately, X is a. block 
matrix with many zero blocks. Thus, algorithms for lasso regressions can be ef- 
ficiently implemented by taking into consideration this structure (see the Supple- 
mental Material for the detailed implementation). To further decrease the compu- 
tational cost, we develop a new algorithm active-shooting (Section 12. 3p for the 
space model fitting. Active-shooting is a modification of the shooting algorithm, 



which was first proposed by Fu (1998) and then extended by many others includ- 



ing Genkin et al. (2007) and Friedman et al. (2007a) Active-shooting exploits the 



sparse nature of sparse penalization problems in a more efficient way, and is therefore 
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computationally much faster. This is crucial for applying space for large p and/or n. 
It can be shown that the computational cost of space is min(0(np^), 0(p^)), which 
is the same as applying p individual lasso regressions as in the neighborhood selection 
approach. We want to point out that, the proposed method can also be implemented 
by lars (lEfron et al. 2004|) . However, unless the exact whole solution path is needed, 



compared with shooting type algorithms, lars is computationally less appealing 
( IFriedman et al. 2007ap . (Remark by the authors: after this paper was submitted. 



recently the active-shooting idea was also proposed by Friedman et al. (2008) 

Finally, note that the concentration matrix should be positive definite. In prin- 
ciple, the proposed method (or more generally, the regression based methods) does 
not guarantee the positive definiteness of the resulting estimator, while the likelihood 



based method by Yuan and Lin (2007) and Friedman et al. (2007b) assures the posi- 
tive definiteness. While admitting that this is one limitation of the proposed method, 
we argue that, since we are more interested in model selection than parameter esti- 
mation in this paper, we are less concerned with this issue. Moreover, in Section [5l we 
show that the proposed estimator is consistent under a set of suitable assumptions. 
Therefore, it is asymptotically positive definite. Indeed, the space estimators are 
rarely non-positive-definite under the high dimensional sparse settings that we are 
interested in. More discussions on this issue can be found in Section [3l 

2.3 Active Shooting 

In this section, we propose a computationally very efficient algorithm active-shooting 
for solving lasso regression problems. Active-shooting is motivated by the shooting 
algorithm (IFu 1998p . which solves the lasso regression by updating each coordinate it- 



eratively until convergence. Shooting is computationally very competitive compared 
with the well known lars procedure (lEfron et al. 20041) . Suppose that we want to 
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minimize an ii penalized loss function with respect to (3 

/(/?) = ^I|Y-X/5||^ + 7^|/3,.|, 

j 

where Y = {yi, ■■■ , y^Y , X = {xij)n>,p = (Xi : ■ ■ ■ : Xp) and (3 = {pi, ■ ■ ■ , pp^. The 
shooting algorithm proceeds as follows: 

1. Initial step: for j = 1, ■ ■ ■ ,p, 



Pf = argmin^^.{l||Y-/3,X,||2 + 7|/3,.|} 
= sign(Y^X,)^^;#^, 



(4) 



where = a:/(^>o). 
2. For j = 1, update p^°^'^^ — > 



3(ne?«) _ „„„^- 1 



= argmin.J Y-E.^,/?r^X,-/5,X, +7|/5,| (5) 

. /(e(°"))^X, , ^(oid)\ / (e(°'''))rx, , n{old) 



XJX, 

where e^"''^) = Y-X^(°''^). 

3. Repeat step 2 until convergence. 

At each updating step of the shooting algorithm, we define the set of currently 
non-zero coefficients as the active set. Since under sparse models, the active set 
should remain small, we propose to first update the coefficients within the active 
set until convergence is achieved before moving on to update other coefficients. The 
active-shooting algorithm proceeds as follows: 

1. Initial step: same as the initial step of shooting. 
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2. Define the current active set A = {k : current (3k 7^ 0}. 

(2.1) For each /c G A, update f3k with all other coefficients fixed at the current 
value as in equation ([5]); 

(2.2) Repeat (2.1) until convergence is achieved on the active set. 

3. For j = 1 to p, update (3j with all other coefficients fixed at the current value 
as in equation ([5]). If no Pj changes during this process, return the current /? as 
the final estimate. Otherwise, go back to step 2. 



Table 1: The numbers of iterations required by the shooting algorithm and the 
active-shooting algorithm to achieve convergence {n = 100, A = 2). "coef. ^" is 
the number of non-zero coefficients 



p 


coef. # 


shooting 


active-shooting 


200 


14 


29600 


4216 


500 


25 


154000 


10570 


1000 


28 


291000 


17029 



The idea of active-shooting is to focus on the set of variables that is more likely 
to be in the model, and thus it improves the computational efficiency by achieving a 
faster convergence. We illustrate the improvement of the active-shooting over the 
shooting algorithm by a small simulation study of the lasso regression (generated in 



the same way as in Section 5.1 of Friedman et al. (2007a) ). The two algorithms result 
in exact same solutions. However, as can be seen from Table [H active- shooting 
takes much fewer iterations to converge (where one iteration is counted whenever an 
attempt to update a (3j is made). In particular, it takes less than 30 seconds (on 
average) to fit the space model by active-shooting (implemented in c code) for 
cases with 1000 variables, 200 samples and when the resulting model has around 
1000 non-zero partial correlations on a server with two Dual/Core, CPU 3 GHz and 
4 GB RAM. This great computational advantage enables us to conduct large scale 
simulation studies to examine the performance of the proposed method (Section [3]). 
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Remark 1 ; In the initial step, instead of using the univariate soft-shrinkage esti- 
mate, we can use a previous estimate as the initial estimate if such a thing is available. 
For example, when iterating between {p*-'} and {o""}, we can use the previous esti- 
mate of {p'-' } in the current iteration as the initial value. This can further improve 
the computational efficiency of the proposed method, as a better initial value implies a 
faster convergence. Moreover, in practice, often estimates are desired for a series of 
tuning parameters \, whether it is for data exploration or for the selection of X. When 
this is the case, a decreasing-lambda approach can be used to facilitate computation. 
That is, we start with the largest A (which results in the smallest model), then use the 
resulting estimate as the initial value when fitting the model under the second largest 
A and continue in this manner until all estimates are obtained. 



2.4 Tuning 



The choice of the tuning parameter A is of great importance. Since the space method 
uses a lasso criterion, methods that have been developed for selecting the tuning pa- 



rameter for lasso can also be applied to space, such as the GCV in Tibshirani (1996) 



the CV in Fan and Li (2001), the AIC in Buhlmann (2006) and the BIG in Zou et al. (2007) 



Several methods have also been proposed for selecting the tuning parameter in the set- 



ting of covariance estimation, for example, the MSE based criterion in Schafer and Strimmer (2007) 



the likelihood based method in Huang et al. (2006) and the cross-validation and boot 



strap methods in Li and Gui (2006) In this paper, we propose to use a "BIG-type" 



criterion for selecting the tuning parameter mainly due to its simplicity and compu- 
tational easiness. For a given A, denote the space estimator by 6x = {'pi : 1 < i < 
j < p} and ax = {o'x : 1 < < p}- The corresponding residual sum of squares for the 
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i-th regression: yi = Y^j^i PijVj + is 



n 

k=i \ j^i 



[a 



We then define a "BIC-type" criterion for tlie z-tli regression as 

BIC,{\) = nx \ogiRSS,iX)) + \ogn x #{j : j ^ ^ 0}. (6) 

Finally, we define BIC{\) := XliLi -^-^^«('^) ^^"^ select A by minimizing BIC{\). 
This method is referred to as space, joint hereafter. 



In Yuan and Lin (2007) , a BIG criterion is proposed for the penalized maximum 



likelihood approach. Namely 

BIC{\) := nx -log|S-i| +trace(%^S) +lognx#{(2, j) : 1 < i < j < p, 0}, 

(7) 

where S is the sample covariance matrix, and S^^ = (?^"') is the estimator un- 
der A. In this paper, we refer to this method as glasso.like. For the purpose 
of comparison, we also consider the selection of the tuning parameter for MB. Since 
MB essentially performs p individual lasso regressions, the tuning parameter can be 
selected for each of them separately. Specifically, we use criterion ^ (evaluated 
at the corresponding MB estimators) to select the tuning parameter Aj for the i- 
th regression. We denote this method as MB . sep. Alternatively, as suggested by 



Meinshausen and Buhlmann (2006) , when all Yi are standardized to have sample 
standard deviation one, the same \{a) = ^/n^^^{l — ^) is applied to all regres- 
sions. Here, $ is the standard normal c.d.f.; a is used to control the false discovery 
rate and is usually taken as 0.05 or 0.1. We denote this method as MB. alpha. These 
methods are examined by the simulation studies in the next section. 
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3 SIMULATION 

In this section, we conduct a series of simulation experiments to examine the 
performance of the proposed method space and compare it with the neighborhood 
selection approach MB as well as the penalized likelihood method glasso. For all 
three methods, variables are first standardized to have sample mean zero and sample 
standard deviation one before model fitting. For space, we consider three different 
types of weights: (1) uniform weights: Wi = 1; (2) residual variance based weights: 
Wi = a**; and (3) degree based weights: Wi is proportional to the estimated degree 
of Hi, i.e., #{j : p*-' 7^ 0,j 7^ i}. The corresponding methods are referred as space, 
space . sw and space . dew, respectively. For all three space methods, the initial value 
of cr" is set to be one. Iterations are used for these space methods as discussed in 
Section 12.11 For space . dew and space . sw, the initial weights are taken to be one 
(i.e., equal weights). In each subsequent iteration, new weights are calculated based 
on the estimated residual variances (for space . sw) or the estimated degrees (for 
space . dew) of the previous iteration. For all three space methods, three iterations 
(that is updating between {a**} and {p*-'}) are used since the procedure converges 
very fast and more iterations result in essentially the same estimator. For glasso, 
the diagonal of the concentration matrix is not penalized. 

We simulate networks consisting of disjointed modules. This is done because many 
real life large networks exhibit a modular structure comprised of many disjointed or 
loosely connected components of relatively small size. For example, experiments on 
model organisms like yeast or bacteria suggest that the transcriptional regulatory 
networks have modular structures (ILee et al. 2002[) . Each of our network modules is 



set to have 100 nodes and generated according to a given degree distribution, where 
the degree of a node is defined as the number of edges connecting to it. We mainly 
consider two different types of degree distributions and denote their corresponding 
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networks by Hub network and Power-law network (details are given later). Given 
an undirected network with p nodes, the initial "concentration matrix" {(T^^)pxp is 
generated by 



1, i = j; 

0, i ^ j and no edge between nodes i and j 

Umform{[—l, —0.5] U [0.5, 1]), i ^ j and an edge connecting nodes i and 



We then rescale the non-zero elements in the above matrix to assure positive definite- 
ness. Specifically, for each row, we first sum the absolute values of the off-diagonal 
entries, and then divide each off-diagonal entry by 1.5 fold of the sum. We then aver- 
age this re-scaled matrix with its transpose to ensure symmetry. Finally the diagonal 
entries are all set to be one. This process results in diagonal dominance. Denote the 
final matrix as A. The co variance matrix S is then determined by 

j) = A-1(z,j)/a/A-i(^,0A-iUj)- 

Finally, i.i.d. samples {Y'^}^^]^ are generated from Normal(0, S). Note that, i) = 
1, and S^i(i,i) = a'' > 1. 

Simulation Study I 

Hub networks In the first set of simulations, module networks are generated by 
inserting a few hub nodes into a very sparse graph. Specifically, each module consists 
of three hubs with degrees around 15, and the other 97 nodes with degrees at most 
four. This setting is designed to mimic the genetic regulatory networks, which usually 
contains a few hub genes plus many other genes with only a few edges. A network 



consisting of five such modules is shown in Figure 1(a) In this network, there are 



p = 500 nodes and 568 edges. The simulated non-zero partial correlations fall in 
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(-0.67,-0.1] U [0.1,0.67), with two modes around -0.28 and 0.28. Based on this 
network and the partial correlation matrix, we generate 50 independent data sets 
each consisting of n = 250 i.i.d. samples. 

We then evaluate each method at a series of different values of the tuning param- 



eter A. The number of total detected edges (Nt) decreases as A increases. Figure 2(a' 



shows the number of correctly detected edges (Nc) vs. the number of total detected 
edges (Nt) averaged across the 50 independent data sets for each method. We observe 
that all three space methods (space, space . sw and space . dew) consistently detect 
more correct edges than the neighborhood selection method MB (except for space . sw 
when Nt < 470) and the likelihood based method glasso. MB performs favorably 
over glasso when Nf is relatively small (say less than 530), but performs worse than 
glasso when Nt is large. Overall, space. dew is the best among all methods. Specif- 
ically, when Nf = 568 (which is the number of true edges), space. dew detects 501 
correct edges on average with a standard deviation 4.5 edges. The corresponding sen- 
sitivity and specificity are both 88%. Here, sensitivity is defined as the ratio of the 
number of correctly detected edges to the total number of true edges; and specificity 
is defined as the ratio of the number of correctly detected edges to the number of 
total detected edges. On the other hand, MB and glasso detect 472 and 480 correct 
edges on average, respectively, when the number of total detected edges Nt is 568. 

In terms of hub detection, for a given Nt, a rank is assigned to each variable i/i 
based on its estimated degree (the larger the estimated degree, the smaller the rank 
value). We then calculate the average rank of the 15 true hub nodes for each method. 



The results are shown in Figure 2(b) This average rank would achieve the minimum 
value 8 (indicated by the grey horizontal line), if the 15 true hubs have larger estimated 
degrees than all other non-hub nodes. As can be seen from the figure, the average rank 
curves (as a function of Nt) for the three space methods are very close to the optimal 
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minimum value 8 for a large range of A^^. This suggests that these methods can 
successfully identify most of the true hubs. Indeed, for space .dew, when A^^ equals to 
the number of true edges (568), the top 15 nodes with the highest estimated degrees 
contain at least 14 out of the 15 true hub nodes in all replicates. On the other hand, 
both MB and glasso identify far fewer hub nodes, as their corresponding average rank 
curves are much higher than the grey horizontal line. 

Table 2: Power (sensitivity) of space . dew , MB and glasso in identifying correct 
edges when FDR is controlled at 0.05. 



Network 


P 


n 


space. dew MB glasso 


Hub-network 


500 


250 


0.844 0.784 0.655 


Hub-network 


1000 


200 
300 
500 


0.707 0.656 0.559 
0.856 0.790 0.690 
0.963 0.894 0.826 


Power-law network 


500 


250 


0.704 0.667 0.580 



To investigate the impact of dimensionality p and sample size n, we perform 
simulation studies for a larger dimension with p = 1000 and various sample sizes 
with n = 200, 300 and 500. The simulated network includes ten disjointed modules 
of size 100 each and has 1163 edges in total. Non-zero partial correlations form 
a similar distribution as that of the p = 500 network discussed above. The ROC 
curves for space. dew, MB and glasso resulted from these simulations are shown in 
Figure [3l When false discovery rate (=l-specificity) is controlled at 0.05, the power 
(= sensitivity) for detecting correct edges is given in Table [2j From the figure and 
the table, we observe that the sample size has a big impact on the performance of all 
methods. For p = 1000, when the sample size increases from 200 to 300, the power of 
space . dew increases more than 20%; when the sample size is 500, space . dew achieves 
an impressive power of 96%. On the other hand, the dimensionality seems to have 
relatively less influence. When the total number of variables is doubled from 500 to 
1000, with only 20% more samples (that is p = 500, n = 250 vs. p = 1000, n = 300), 
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all three methods achieve similar powers. This is presumably because the larger 
network {p = 1000) is sparser than the smaller network {p = 500) and also the 
complexity of the modules remains unchanged. Finally, it is obvious from Figure [3] 
that, space. dew performs best among the three methods. 



Table 3: Edge detection under the selected tuning parameter A. For average rank, 
the optimal value is 15.5. For MB. alpha, a = 0.05 is used. 



Sample size 


Method 


Total edge detected 


Sensitivity 


Specificity 


Average rank 




space. joint 


1357 


0.821 


0.703 


28.6 


n = 200 


MB . sep 


1240 


0.751 


0.703 


57.5 




MB . alpha 


404 


0.347 


1.00 


175.8 




glasso . like 


1542 


0.821 


0.619 


35.4 




space. joint 


1481 


0.921 


0.724 


18.2 


re = 300 


MB . sep 


1456 


0.867 


0.692 


30.4 




MB . alpha 


562 


0.483 


1.00 


128.9 




glasso . like 


1743 


0.920 


0.614 
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space. joint 


1525 


0.980 


0.747 


16.0 


re = 500 


MB . sep 


1555 


0.940 


0.706 


16.9 




MB . alpha 


788 


0.678 


1.00 


52.1 




glasso . like 


1942 


0.978 


0.586 


16.5 



We then investigate the performance of these methods at the selected tuning pa- 
rameters (see Sect ion [23] for details). For the above Hub network with p = 1000 nodes 
and n = 200, 300, 500, the results are reported in Table [31 As can be seen from the 
table, BIG based approaches tend to select large models (compared to the true model 
which has 1163 edges), space .joint and MB . sep perform similarly in terms of speci- 
ficity, and glasso. like works considerably worse than the other two in this regard. 
On the other hand, space, joint and glasso. like performs similarly in terms of 
sensitivity, and are better than MB. sep on this aspect. In contrast, MB. alpha selects 
very small models and thus results in very high specificity, but very low sensitivity. 
In terms of hub identification, space, joint apparently performs better than other 
methods (indicated by a smaller average rank over 30 true hub nodes). Moreover, the 
performances of all methods improve with sample size. 
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Power-law networks Many real world networks have a power-law (also a.k.a scale- 
free) degree distribution with an estimated power parameter a = 2 ~ 3 (INewman 2003p . 
Thus, in the second set of simulations, the module networks are generated according 
to a power-law degree distribution with the power-law parameter a = 2.3, as this value 
is close to the estimated power parameters for biological networks (INewman 2003p . 



Figure 1(b) illustrates a network formed by five such modules with each having 100 
nodes. It can be seen that there are three obvious hub nodes in this network with 
degrees of at least 20. The simulated non-zero partial correlations fall in the range 
(-0.51, -0.08] U [0.08,0.51), with two modes around -0.22 and 0.22. Similar to the 
simulation done for Hub networks, we generate 50 independent data sets each con- 
sisting of n = 250 i.i.d. samples. We then compare the number of correctly detected 
edges by various methods. The result is shown in Figure |H On average, when the 
number of total detected edges equals to the number of true edges which is 495, 
space . dew detects 406 correct edges, while MB detects only 378 and glasso detects 
only 381 edges. In terms of hub detection, all methods can correctly identify the three 
hub nodes for this network. 

Summary These simulation results suggest that when the (concentration) networks 
are reasonably sparse, we should be able to characterize their structures with only 
a couple-of-hundreds of samples when there are a couple of thousands of nodes. In 
addition, space. dew outperforms MB by at least 6% on the power of edge detection 
under all simulation settings above when FDR is controlled at 0.05, and the improve- 
ments are even larger when FDR is controlled at a higher level say 0.1 (see Figure [3]) • 
Also, compared to glasso, the improvement of space .dew is at least 15% when FDR 
is controlled at 0.05, and the advantages become smaller when FDR is controlled at 
a higher level (see Figure [3]). Moreover, the space methods perform much better in 
hub identification than both MB and glasso. 
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Simulation Study II 



In the second simulation study, we apply space, MB and glasso on networks with 
nearly uniform degree distributions generated by following the simulation procedures 
in 



Meinshausen and Buhlmann (2006) as well as on the AR network discussed in 



Yuan and Lin (2007) and Friedman et al. (2007b) For these cases, space performs 



comparably, if not better than, the other two methods. However, for these networks 
without hubs, the advantages of space become smaller compared to the results on 
the networks with hubs. The results are summarized below. 

Uniform networks In this set of simulation, we generate similar networks as 



the ones used in Meinshausen and Buhlmann (2006) These networks have uniform 



degree distribution with degrees ranging from zero to four. Figure 5(a) illustrates 



a network formed by five such modules with each having 100 nodes. There are in 



total 447 edges. Figure 5(b) illustrates the performance of MB, space and glasso over 
50 independent data sets each having n = 250 i.i.d. samples. As can be been from 
this figure, all three methods perform similarly. When the total number of detected 
edges equals to the total number of true edges (447), space detects 372 true edges, 
MB detects 369 true edges and glasso 371 true edges. 

AR networks In this simulation, we consider the so called AR network used in 



Yuan and Lin (2007) and Friedman et al. (2007b) Specifically, we have a** = 1 for 
i = I, - ■ - p and cx*"^'* 



a 



0.25 for i = 2, ■ ■ ■ ,p. Figure 6(a) illustrates such a 



network with p = 500 nodes and thus 499 edges. Figure 6(b) illustrates the perfor- 



mance of MB, space and glasso over 50 independent data sets each having n = 250 
i.i.d. samples. As can be been from this figure, all three methods again perform 
similarly. When the total number of detected edges equals to the total number of 
true edges (499), space detects 416 true edges, MB detects 417 true edges and glasso 
411 true edges. As a slight modification of the AR network, we also consider a big 
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circle network with: o"** = 1 for i = 1, ■ ■ ■ p; i-* = u*-* i = 0.3 for i = 2, ■ ■ ■ ,p and 



0"^'^ = 0"^'^ = 0.3. Figure 7(a) illustrates such a network with p = 500 nodes and thus 



500 edges. Figure 7(b) compares the performance of the three methods. When the 
total number of detected edges equals to the total number of true edges (500), space, 
MB and glasso detect 478, 478 and 475 true edges, respectively. 

We also compare the mean squared error (MSE) of estimation of {(t"}. For the 
uniform network, the median (across all samples and A) of the square-root MSE is 
0.108,0.113,0.178 for MB, space and glasso. These numbers are 0.085,0.089,0.142 
for the AR network and 0.128,0.138,0.233 for the circle network. It seems that MB 
and space work considerably better than glasso on this aspect. 



Comments 

We conjecture that, under the sparse and high dimensional setting, the superior 
performance in model selection of the regression based method space over the pe- 
nalized likelihood method glasso is partly due to its simpler quadratic loss function. 
Moreover, since space ignores the correlation structure of the regression residuals, it 
amounts to a greater degree of regularization, which may render additional benefits 
under the sparse and high dimensional setting. 

In terms of parameter estimation, we compare the entropy loss of the three meth- 
ods. We find that, they perform similarly when the estimated models are of small 
or moderate size. When the estimated models are large, glasso generally performs 
better in this regard than the other two methods. Since the interest of this paper lies 
in model selection, detailed results of parameter estimation are not reported here. 

As discussed earlier, one limitation of space is its lack of assurance of positive 
definiteness. However, for simulations reported above, the corresponding estimators 
we have examined (over 3000 in total) are all positive definite. To further investigate 
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this issue, we design a few additional simulations. We first consider a case with 
a similar network structure as the Hub network, however having a nearly singular 
concentration matrix (the condition number is 16, 240; as a comparison, the condition 
number for the original Hub network is 62). For this case, the estimate of space 
remains positive definite until the number of total detected edges increases to 50, 000; 
while the estimate of MB remains positive definite until the number of total detected 
edges is more than 23, 000. Note that, the total number of true edges of this model 
is only 568, and the model selected by space, joint has 791 edges. In the second 
simulation, we consider a denser network (p = 500 and the number of true edges 
is 6,188) with a nearly singular concentration matrix (condition number is 3,669). 
Again, we observe that, the space estimate only becomes non-positive-definite when 
the estimated models are huge (the number of detected edges is more than 45,000). 
This suggests that, for the regime we are interested in in this paper (the sparse and 
high dimensional setting), non-positive-definiteness does not seem to be a big issue 
for the proposed method, as it only occurs when the resulting model is huge and thus 
very far away from the true model. As long as the estimated models are reasonably 
sparse, the corresponding estimators by space remain positive definite. We believe 
that this is partly due to the heavy shrinkage imposed on the off-diagonal entries in 
order to ensure sparsity. 

Finally, we investigate the performance of these methods when the observations 
come from a non-normal distribution. Particularly, we consider the multivariate ta/- 
distribution with df = 3, 6, 10. The performances of all three methods deteriorate 
compared to the normal case, however the overall picture in terms of relative perfor- 
mance among these methods remains essentially unchanged (Table HI). 
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Table 4: Sensitivity of different metliods under different trf/-distributions when FDR 
is controlled at 0.05 







Sensitivity 


df 


Method 


Hub 


Power-law 




space 


0.369 


0.286 


3 


MB 


0.388 


0.276 




glasso 


0.334 


0.188 




space 


0.551 


0.392 


6 


MB 


0.535 


0.390 




glasso 


0.471 


0.293 




space 


0.682 


0.512 


10 


MB 


0.639 


0.518 




glasso 


0.598 


0.345 



4 APPLICATION 



More than 500,000 women die annually of breast cancer world wide. Great efforts 
are being made to improve the prevention, diagnosis and treatment for breast cancer. 
Specifically, in the past couple of years, molecular diagnostics of breast cancer have 
been revolutionized by high throughput genomics technologies. A large number of 
gene expression signatures have been identified (or even validated) to have potential 
clinical usage. However, since breast cancer is a complex disease, the tumor process 
cannot be understood by only analyzing individual genes. There is a pressing need to 
study the interactions between genes, which may well lead to better understanding 
of the disease pathologies. 

In a recent breast cancer study, microarray expression experiments were conducted 



for 295 primary invasive breast carcinoma samples ( [Chang et al. 2005t|van de Vijver et al. 2002[ ). 
Raw array data and patient clinical outcomes for 244 of these samples are available on- 
line and are used in this paper. Data can be downloaded at http: //mi croarr ay-pubs . Stanford, edu/wounc 
. To globally characterize the association among thousands of mRNA expression lev- 
els in this group of patients, we apply the space method on this data set as follows. 
First, for each expression array, we perform the global normalization by centering the 
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mean to zero and scaling the median absolute deviation to one. Then we focus on a 
subset of p = 1217 genes/clones whose expression levels are significantly associated 
with tumor progression (p-values from univariate Cox models < 0.0008, correspond- 
ing FDR = 0.01). We estimate the partial correlation matrix of these 1217 genes with 
space . dew for a series of A values. The degree distribution of the inferred network is 
heavily skewed to the right. Specifically, when 629 edges are detected, 598 out of the 
1217 genes do not connect to any other genes, while five genes have degrees of at least 
10. The power-law parameter of this degree distribution is a = 2.56 , which is con- 
sistent with the findings in the literature for GRNs (INewman 20031) . The topology 



of the inferred network is shown in Figure 8(a), which supports the statement that 
genetic pathways consist of many genes with few interactions and a few hub genes 
with many interactions. 

We then search for potential hub genes by ranking nodes according to their de- 
grees. There are 11 candidate hub genes whose degrees consistently rank the highest 
under various A [see Figure 8(b)| . Among these 11 genes, five are important known 
regulators in breast cancer. For example, HNF3A (also known as FOXAl) is a 
transcription factor expressed predominantly in a subtype of breast cancer, which 
regulates the expression of the cell cycle inhibitor p27kipl and the cell adhesion 
molecule E-cadherin. This gene is essential for the expression of approximately 50% 
of estrogene-regulated genes and has the potential to serve as a therapeutic target 
fINakshatri and Badve 2"007l) . Except for HNF3A, all the other 10 hub genes fall in the 
same big network component related to cell cycle/proliferation. This is not surpris- 
ing as it is well-agreed that cell cycle/proliferation signature is prognostic for breast 
cancer. Specifically, KNSL6, STK12, RAD54L and BUBl have been previously re- 
ported to play a role in breast cancer: KNSL6 (also known as KIF2C) is important for 
anaphase chromosome segregation and centromere separation, which is overexpressed 
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in breast cancer cells but expressed undetectably in other human tissues except testis 
(IShimo et al. 2008]) : STK12 (also known as AURKB) regulates chromosomal segre- 
gation during mitosis as well as meiosis, whose LOH contributes to an increased breast 
cancer risk and may influence the therapy outcome fITchatchou et al. 2007^ : RAD54L 
is a recombinational repair protein associated with tumor suppressors BRCAl and 
BRCA2, whose mutation leads to defect in repair processes involving homologous re- 
combination and triggers the tumor development (IMatsuda et al. 1999P : in the end, 
BUBl is a spindle checkpoint gene and belongs to the BML-1 oncogene-driven path- 
way, whose activation contributes to the survival life cycle of cancer stem cells and 
promotes tumor progression. The roles of the other six hub genes in breast cancer are 
worth of further investigation. The functions of all hub genes are briefly summarized 
in Table O 



Table 5: Annotation of hub genes 



Index 


Gene Symbol 


Summary Function (GO) 


1 


CENPA 


Encodes a centromere protein (nucleosome assembly) 


2 




Annotation not available 


3 


KNSL6 


Anaphase chromosome segregation (cell proliferation) 


4 


STK12 


Regulation of chromosomal segregation (cell cycle) 


5 


NA. 


Annotation not available 


6 


URLC9 


Annotation not available (up-regulated in lung cancer) 


7 


HNF3A 


Transcriptional factor activity (epithelial cell differentiation) 


8 


TPX2 


Spindle formation (cell proliferation) 


9 


RAD54L 


Homologous recombination related DNA repair (meiosis) 


10 


ID-GAP 


Stimulate GTP hydrolysis (cell cycle) 


11 


BUBl 


Spindle checkpoint (cell cycle) 



5 ASYMPTOTICS 

In this section, we show that under appropriate conditions, the space procedure 
achieves both model selection consistency and estimation consistency. Use 9 and a to 
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denote the true parameters of 9 and a. As discussed in Section 12.11 when a is given, 
9 is estimated by solving the following £i penalization problem: 

9^-{a) = argminL„(^,a,Y) + A„||^||i, (9) 

9 

where the loss function L„(6', cr, Y) := ^^=1 L{9, a, Y'^), with, for = 1, ■ ■ ■ , n 

L(^,a,Y^) ■.= lj2''.{y^ -H^/^^'P^'y')"- (10) 

i=l 

Throughout this section, we assume Y^, ■ ■ ■ , Y" are i.i.d. samples from Np{0, S). 
The Gaussianity assumption here can be relaxed by assuming appropriate tail be- 
haviors of the observations. The assumption of zero mean is simply for exposition 
simplicity. In practice, in the loss function iQ, can be replaced by Y^ — Y where 
Y = ;^ ^fc=i is the sample mean. All results stated in this section still hold under 
that case. 

We first state regularity conditions that are needed for the proof. Define A = 
{(z,j):p^^VO}. 

CO: The weights satisfy < Wq < minjjuij} < maxj{u;j} < Woc < oo 

CI: There exist constants < A^i^{9) < Ajnax(^) < oo, such that the true covariance 
S = ^(9, a) satisfies: < A^in(^) < A„,in(S) < A^ax(S) < A^ax(^) < oo, 
where Amin and Amax denote the smallest and largest eigenvalues of a matrix, 
respectively. 

C2: There exist a constant 6 <1 such that for all ^ A 







-1 _ 






sign(6'^) 
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where for 1 < i < j < p,l < t < s < p, 



Condition CO says that the weights are bounded away from zero and infinity. Con- 
dition CI assumes that the eigenvalues of the true covariance matrix S are bounded 
away from zero and infinity. Condition C2 corresponds to the incoherence condition 
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Meinshausen and Buhlmann (2006) , which plays a crucial role in proving model 
selection consistency of £i penalization problems. 

Furthermore, since a is usually unknown, it needs to be estimated. Use a = dn = 
l^iijp^^ to denote one estimator. The following condition says 

D : For any ?7 > 0, there exists a constant C > 0, such that for sufficiently large 



n, maxi<j<p — < CUl"^^^) holds with probability at least 1 — 0{n~ 



Note that, the theorems below hold even when a is obtained based on the same 
data set from which 9 is estimated as long as condition D is satisfied. The following 
proposition says that, when p < n, we can get an estimator of cr satisfying condition 
D by simply using the residuals of the ordinary least square fitting. 

Proposition 1 Suppose Y = [Y^ Y"] is a p x n data matrix with i.i.d. 

columns Y* ~ Np{0, S). Further suppose that p = Pn such that p/n < 1 — 6 for some 
6 > 0; and S has a bounded condition number (that is assuming condition CI). Let 
a" denote the {i,i)-th element o/S^^; and let denote the residual from regressing 
Y' on to Y(_,) := [Y^ : ■ ■ ■ : Y^-^ : Y^+^ : ■ ■ ■ : Y"], that is 

e, = V - Yf_,)(Y(_,)Yf_.)-iY(„,)Y\ 
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Define a" = where 



n — p — 1 

t/ien condition D holds for {a^^}^^^. 

The proof of this proposition is omitted due to space hmitation. 

We now state notations used in the main results. Let g„ = |^| denote the num- 
ber of nonzero partial correlations (of the underlying true model) and let be 
a positive sequence of real numbers such that for any {i,j) G A: l/f-'l > s„. Note 
that, Sn can be viewed as the signal size. We follow the similar strategy as in 



Meinshausen and Buhlmann (2006) and Massam et al. (2007) in deriving the asymp- 



totic result: (i) First prove estimation consistency and sign consistency for the re- 
stricted penalization problem with 6j^c = (Theorem [T]). We employ the method 



of the proof of Theorem 1 in Fan and Peng (2004) ; (ii) Then we prove that with 



probability tending to one, no wrong edge is selected (Theorem [2]); (iii) The final 
consistency result then follows (Theorem [3l). 

Theorem 1 (consistency of the restricted problem) Suppose that conditions CO-Cl 



and D are satisfied. Suppose further thatqn ~ '^(y' j^^), A„y' — >■ oo and ^/q^Xn ~ 
o(l), as n oo. Then there exists a constant C{6) > 0, such that for any rj > 0, the 
following events hold with probability at least 1 — 0{n~'^): 

• there exists a solution Q-^'^" = 6'-^''^" (a) of the restricted problem: 



min L„,(^,a,Y) + A„||^||i, (11) 



where the loss function is defined via ( fiOj) . 

(estimation consistency) any solution 6*-^'-*'" of the restricted problem / fii]) satis- 
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fies: 

\\9-^'^--eA\\2<C(9)^Xn. 



• (sign consistency) if further assume that the signal sequence satisfies: -^^/=^ 

oo, — oo, then sign(6'J^''^") = sign(6'jj), for all 1 < i < j < p. 

Theorem 2 Suppose that conditions C0-C2 and D are satisfied. Suppose further that 



p = 0{n-) for some k > 0; g„ ~ o(^j^), ^ = o{\n), ^ oo and 

y^Xn ~ o(l), as n ^ oo. Then for any rj > 0, for n sufficiently large, the solution 
of (Ep satisfies 

P(^-) (^^max^ Y)| < > 1 - 0(n"''), 

where L'^^^j := gt- 

Theorem 3 Assume the same conditions of Theorem\^ Then there exists a constant 
C{6) > 0, such that for any rj > the following events hold with probability at least 
1 - 0(n-'').- 

• there exists a solution 6^^ = 6^" (a) of the ii penalization problem 

minL„(^,a,Y) + A„,||^||i, (12) 

6 

where the loss function L„ is defined via / flOj) . 

• (estimation consistency): any solution 6^" of [W) satisfies: 

\\e^--e\\2<c{e){^K). 
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• (Model selection consistency /sign consistency): 

sign(6'- ") = sign{9ij), for all 1 < i < j < p. 



Proofs of these theorems are given in the Supplemental Material. Finally, due 
to exponential small tails of the probabilistic bounds, model selection consistency 
can be easily extended when the network consists of disjointed components with 
= 0(n") for some a > 0, as long as the size and the number of true edges of each 
component satisfy the corresponding conditions in Theorem O 



Remark 2 The condition Any ^ oo is indeed implied by the condition y ^ = 
o(A„) as long as Qn does not go to zero. Moreover, under the "worst case" scenario, 



that is when g„ is almost in the order of A„ needs to he nearly in the order of 

^-1/4^ On the other hand, for the"hest case" scenario, that is when Qn = 0(1) (for 
example, when the dimension p is fixed), the order of A„ can be nearly as small as 
n'^l"^ (within a factor oflogn). Consequently, the i2-norm distance of the estimator 
from the true parameter is in the order of ^J\ogn/n, with probability tending to one. 



6 SUMMARY 

In this paper, we propose a joint sparse regression model - space - for select- 
ing non-zero partial correlations under the high-dimension-low-sample-size setting. 
By controlling the overall sparsity of the partial correlation matrix, space is able to 
automatically adjust for different neighborhood sizes and thus to utilize data more 
effectively. The proposed method also explicitly employs the symmetry among the 
partial correlations, which also helps to improve efficiency. Moreover, this joint model 
makes it easy to incorporate prior knowledge about network structure. We develop a 
fast algorithm active-shooting to implement the proposed procedure, which can be 
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readily extended to solve some other penalized optimization problems. We also pro- 
pose a "BIC-type" criterion for the selection of the tuning parameter. With extensive 
simulation studies, we demonstrate that this method achieves good power in non-zero 
partial correlation selection as well as hub identification, and also performs favorably 
compared to two existing methods. The impact of the sample size and dimensionality 
has been examined on simulation examples as well. We then apply this method on a 
microarray data set of 1217 genes from 244 breast cancer tumor samples, and find 11 
candidate hubs, of which five are known breast cancer related regulators. In the end, 
we show consistency (in terms of model selection and estimation) of the proposed 
procedure under suitable regularity and sparsity conditions. 

The R package space - Sparse PArtial Correlation Estimation - is available on 
cran. 

ACKNOWLEDGEMENT 

A paper based on this report has been accepted for publication on Journal of the 
American Statistical Association (http://www.amstat.org/publications/JASA/). 
We are grateful to two anonymous reviewers and an associate editor for their valuable 
comments. 

Peng is partially supported by grant DMS-0806128 from the National Science 
Foundation and grant 1R01GM082802-01A1 from the National Institute of General 
Medical Sciences. Wang is partially supported by grant 1R01GM082802-01A1 from 
the National Institute of General Medical Sciences. Zhou and Zhu are partially sup- 
ported by grants DMS-0505432 and DMS-0705532 from the National Science Foun- 
dation. 



34 



References 



Barabasi, A. L., and Albert, R. (1999), "Emergence of Scaling in Random Net- 
works," Science, 286, 509-512. 

Barabasi, A. L., and Oltvai, Z. N. (2004), "Network Biology: Understanding the 
Cells Functional Organization," Nature Reviews Genetics, 5, 101-113. 

Bickel, P.J., and Levina, E. (2008), "Regularized Estimation of Large Covariance 
Matrices," Annals of Statistics, 36, 199-227. 

Buhlmann, P. (2006), "Boosting for High-dimensional Linear Models," Annals of 
Statistics, 34, 559-583. 

Chang, H. Y., Nuyten, D. S. A., Sneddon, J. B., Hastie, T., Tibshirani, R., Sorlie, 
T., Dai, H., He,Y., van't Veer, L., Bartelink, H., and et al. (2005), "Robustness, 
Scalability, and Integration of a Wound Response Gene Expression Signature 
in Predicting Survival of Human Breast Cancer Patients," Proceedings of the 
National Academy of Sciences, 8;102(10): 3738-43. 

Dempster, A. (1972), "Covariance Selection," Biometrics, 28, 157-175. 

Edward, D. (2000), Introduction to Graphical Modelling (2nd ed.). New York: 
Springer. 

Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004), "Least Angle Re- 
gression," Annals of Statistics, 32, 407-499. 

Fan, J., and Li, R. (2001), "Variable Selection via Nonconcave Penalized Likeli- 
hood and Its Oracle Properties," Journal of the American Statistical Associa- 
tion, 96, 1348-1360. 

Fan, J., and Peng, H. (2004), "Nonconcave Penalized Likelihood with a Diverging 
Number of Paramters," Annals of statistics, 32(3), 928-961. 



35 



Friedman, J., Hastie, T., Hofling, H., and Tibshirani, R. (2007a), "Pathwise Coot- 
dinaXe Optimization " Annals of Applied Statistics , 1(2), 302-332. 

Friedman, J., Hastie, T., and Tibshirani, R. (2007b), "Sparse In- 
verse Covariance Estimation with the Graphical Lasso," Biostatis- 
tics doi:10.1093/biostatistics/kxm045. 

Friedman, J., Hastie, T., and Tibshirani, R. (2008), "Regularization Paths 
for Generahzed Linear Models via Coordinate Descent," Technical report: 
http:/ /www-stat. Stanford, edvr^jhf /ftp / glmnet.pdf 

Fu, W. (1998), "Penalized Regressions: the Bridge vs the Lasso," Journal of Com- 
putational and Graphical Statistics, 7(3), 397-416. 

Gardner, T. S., Bernardo, D. di, Lorenz, D., and Collins, J. J. (2003), "Inferring 
Genetic Networks and Identifying Compound Mode of Action via Expression 
Profiling," Science, 301, 102-105. 

Genkin, A., Lewis, D. D, and Madigan, D. (2007), "Large-Scale Bayesian Logistic 
Regression for Text Categorization," Technometrics , 49, 291-304. 

Huang, J., Liu, N., Pourahmadi, M., and Liu, L. (2006), "Covariance Matrix Selec- 
tion and Estimation via Penalised Normal Likelihood," Biometrika, 93, 85-98. 

Jeong, H., Mason, S. P., Barabasi, A. L., and Oltvai, Z. N. (2001), "Lethality and 
Centrality in Protein Networks," Nature, 411, 41-42. 

Lee, T. I., Rinaldi, N. J., Robert, F., Odom, D. T., Bar- Joseph, Z., Gerber, G. K., 
Hannett, N. M., Harbison, C. T., Thompson, C. M., Simon, I., Zeitlinger, 
J., and et al. (2002), "Transcriptional Regulatory Networks in Saccharomyces 
Cerevisiae," Science, 298, 799-804. 

Levina, E. and Rothman, A. J. and Zhu, J. (2006), "Sparse Estimation of Large 
Covariance Matrices via a Nested Lasso Penalty," Biometrika, 90, 831-844. 

36 



Li, H., and Gui, J. (2006), "Gradient Directed Regularization for Sparse Gaussian 
Concentration Graphs, with Apphcations to Inference of Genetic Networks," 
BiostaUUcs, 7(2) ,302-317. 

Massam, H., Paul, D., and Rajaratnam, B. (2007), "Penahzed Empirical Risk 
Minimization Using a Convex Loss Function and ii Penalty," Unpublished 
Manuscript. 

Matsuda, M., Miyagawa, K., Takahashi, M., Fukuda, T., Kataoka, T., Asahara, 
T., Inui, H., Watatani, M., Yasutomi, M., Kamada, N., Dohi, K., and Kamiya, 
K. (1999), "Mutations in the Rad54 Recombination Gene in Primary Cancers," 
Oncogene, 18, 3427-3430. 

Meinshausen, N., and Buhlmann, P. (2006), "High Dimensional Graphs and Vari- 
able Selection with the Lasso," Annals of Statistics, 34, 1436-1462. 

Nakshatri, H., and Badve, S. (2007), "FOXAl as a Therapeutic Target for Breast 
Cancer," Expert Opinion on Therapeutic Targets, 11, 507-514. 

Newman, M. (2003), "The Structure and Function of Complex Networks," Society 
for Industrial and Applied Mathematics , 45(2), 167-256. 

Rothman, A. J. and Bickel, P. J. and Levina, E. and Zhu, J. (2008), "Sparse permu- 
tation invariant covariance estimation," Electronic Journal of Statistics , 2, 494- 
515. 

Schafer, J., and Strimmer, K. (2007), "A Shrinkage Approach to Large-Scale Co- 
variance Matrix Estimation and Implications for Functional Genomics," Statis- 
tical Applications in Genetics and Molecular Biology, 4(1), Article 32. 

Shimo, A., Tanikawa, C, Nishidate, T., Lin, M., Matsuda, K., Park, J., Ueki, T., 
Ohta, T., Hirata, K., Fukuda, M., Nakamura, Y., and Katagiri, T. (2008), "In- 
volvement of Kinesin Family Member 2C/Mitotic Centromere- Associated Ki- 

37 



nesin Overexpression in Mammary Carcinogenesis," Cancer Science, 99(1), 62- 
70. 

Tchatchou, S., Wirtenberger, M., Hemminki, K., Sutter, C, Meindl, A., Wap- 
penschmidt, B., Kiechle, M., Bugert, P., Schmutzler, R., Bartram, C, and 
Burwinkel, B. (2007), "Aurora Kinases A and B and Familial Breast Cancer 
Risk," Cancer Letters, 247(2), 266-272. 

Tegner, J., Yeung, M. K., Hasty, J., and Collins, J. J. (2003), "Reverse Engineering 
Gene Networks: Integrating Genetic Perturbations with Dynamical Modeling," 
Proceedings of the National Academy of Sciences USA, 100, 5944-5949. 

Tibshirani, R. (1996), "Regression Shrinkage and Selection via the Lasso," Journal 
of the Royal Statistical Society, Series B, 58, 267-288. 

van de Vijver, M. J., He, Y. D., van 't Veer, L. J., Dai, H., Hart, A. A.M., Voskuil, 
D. W., Schreiber, G. J., Peterse, J. L., Roberts, C, Marton, M. J., and et al. 
(2002), "A Gene- Expression Signature as a Predictor of Survival in Breast Can- 
cer," New England Journal of Medicine, 347, 1999-2009. 

Whittaker, J. (1990), Graphical Models in Applied Mathematical Multivariate 
Statistics, Wiley. 

Wu, W. B. and Pourahmadi, M. (2003), "Nonparametric estimation of large covari- 
ance matrices of longitudinal data," Annals of Applied Statistics, 2, 245-263. 

Yuan, M., and Lin, Y. (2007), "Model Selection and Estimation in the Gaussian 
Graphical Model," Biometrika, 94(1), 19-35. 

Zou, H., Hasite, T., and Tibshirani, R. (2007), "On the Degrees of Freedom of the 
Lasso," Annals of Statistics, 35, 2173-2192. 



38 



• 'V v.- • 



/ 



1// 



WSf" 



/ 



'/ 



(a) Hub network: 500 nodes and 568 edges. 15 nodes (in black) have degrees of 
around 15. 



/ lit- 



(b) Power-law network: 500 nodes and 495 edges. 3 nodes (in black) have degrees 
at least 20. 

Figure 1: Topology of simulated networks. 
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400 450 500 550 600 650 



Total detected edges 

(a) X-axis: the number of total detected edges(i.e., the total num- 
ber of pairs with p'^ ^ 0); y-axis: the number of correctly 
identified edges. The vertical grey line corresponds to the number 
of true edges. 
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(b) X-axis: the number of total detected edges; y-axis: the average 
rank of the estimated degrees of the 15 true hub nodes. 



Figure 2: Simulation results for Hub network. 
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1 -specificity 



Figure 3: Hub network: ROC curves for different samples sizes [p = 1000). 




Figure 4: Simulation results for Power-law network, x-axis: the number of total 
detected edges; y-axis: the number of correctly identified edges. The vertical grey 
line corresponds to the number of true edges. 



41 



MB: Total Edge= 420 
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(a) Uniform network: 500 nodes and 447 edges. 




(b) Simulation results for Uniform network, x-axis: the total number of edges de- 
tected; y-axis: the total number of correctly identified edges. The vertical grey line 
corresponds to the number of true edges. 

Figure 5: Simulation results for Uniform networks. 
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AR: Total Edge= 499 




(a) AR network: 500 nodes and 499 edges. 




(b) Simulation results for AR network, x-axis: the total number of edges detected; 
y-axis: the total number of correctly identified edges. The vertical grey line corre- 
sponds to the number of true edges. 

Figure 6: Simulation results for AR networks. 
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(a) Big-circle network: 500 nodes and 500 edges. 




(b) Simulation results for Circle network, x-axis: the total number of edges de- 
tected; y-axis: the total number of correctly identified edges. The vertical grey line 
corresponds to the number of true edges. 



Figure 7: Simulation results for Circle networks. 
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(a) Network inferred from the real data (only showing components 
with at least three nodes). The gene annotation of the hub nodes 
(numbered) are given in Table 5. 
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(b) Degree ranks (for the 100 genes with highest degrees). Dif- 
ferent circles represent different genes. Solid circles: the 11 
genes with highest degrees. Circles: the other genes. The 
sd(rank) of the top 11 genes are all smaller than 4.62 (4.62 is 
the 1% quantile of sd(rank) among all the 1217 genes), and 
thus are identified as hub nodes. 

Figure 8: Results for the breast cancer expression data set. 
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Supplemental Material 



Part I 



In this section, we list properties of the loss function: 



p 



p 



where Y = {yi, ■ ■ • , yp)'^ and yi = Va^yi,Wi = Wi/a'^\ These properties are used for 
the proof of the main results. Note: throughout the supplementary material, when 
evaluation is taken place at a = o", sometimes we omit the argument a in the notation 
for simplicity. Also we use Y = [yi, - ■ ■ , yp)^ to denote a generic sample and use Y 
to denote the p x n data matrix consisting of n i.i.d. such samples: Y-*^, ■ ■ ■ , Y", and 
define 



Al: for all 9, a and F G 7^^ 1(9, a, Y) > 0. 

A2: for any Y G TZ^ and any cr > 0, L{-,a,Y) is convex in 9; and with probability 
one, L{-,a,Y) is strictly convex. 

A3: for 1 < i < j < p 



1=1 




(S-2) 



k=l 
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A4: for 1 < i < j < p and 1 < k < I < p, 



d^L{9,a,Y)\ d 



dpijpki J dp' 



f dL{e,a,Y) 



and L [6, a) is positive semi-definite. 

If assuming CO-Cl, then we have 

BO : There exist constants < (Tq < (Too < oo such that: < (Tq < min{cT** : 1 < 
^ < p} < max{cr** : 1 < < p} < (Toq. 

Bl : There exist constants < A^^^{9) < A^^^{9) < oo, such that 

< A^^^M < \nUL"m < KUL"m < ALx(^1 < oo 
Bl.l : There exists a constant K{6) < oo, such that for all 1 < i < j < p, L'-j ^j{6) < 

B1.2 : There exist constants Mi{9), M2{9) < oo, such that for any I < i < j < p 
Var(,-^)(L^(^;a,F)) < M^ie), Var(,-^)(L:;. .^-(^^ a, F)) < AU9). 

B1.3 : There exists a constant < g{6) < oo, such that for all G A 



where Aij = A/{{i,j)}. 
B1.4 : There exists a constant M[6) < oo, such that for any (i, j) G A'^ 



'L:a^)[LAAm-%<M{e). 
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B2 There exists a constant Ki{6) < oo, such that for any 1 < i < j < p, 
\\Ee{yiyjyy^)\ \ < Ki{0), where y = {yi, ■ ■ ■ ,yp)^. 

B3 If we further assume that condition D holds for a and g„ ~ ^i'dgn)^ have: for 
any r] > 0, there exist constants Ci^r), C*2,?7 > 0, such that for sufficiently large n 



— — / lo£[ Tl 

max L:,,,(^,a,Y)-LV,(^,a,Y) <Ci,,(W^), 



max K,,,M<^,^)-K.,k,s{0,^,Y)\<C,^^^^^^ 

i<i<k<p,i<t<s<p ' ■ ' ' ■ \ n 

hold with probability at least 1 — 0{n~^). 

BO follows from CI immediately. B1.1-B1.4 are direct consequences of Bl. B2 
follows from Bl and Gaussianity. B3 follows from conditions CO-Cl and D. 

proof of Al: obvious. 
proof of A2: obvious. 

proof of A3: denote the residual for the ith term by 

Then evaluated at the true parameter values [6, a), we have ei{6, a) uncorrelated with 
and E(^g^^-^{ei{9, a)) = 0. It is easy to show 

= -Wiei{9, a)yj - Wjej{9, ajyi. 

This proves A3. 

proof of A4 '■ see the proof of Bl. 

proof of Bl : Denote y = (^i, ■ ■ ■ , ^p)^, and x = (x(i,2), X(i,3), ■ ■ ■ , X(p_i,p)) with 
~ (0; ■ ■ ■ ) 0) ■ ■ ■ ) ^i) 0, ■ ■ ■ , 0)^. Then the loss function (IS-ip can be written 
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as L(9,a,Y) = ^\\w{y — x9)\\l, with w = diag{\/^i, ■ ■ ■ Thus L (9, a) = 

E{e,a) [x'^w'^x'j (this proves A4). Let d = p{p — l)/2, then x is a. p hj d matrix. 
Denote its ith row by {1 < i < p). Then for any a e W^, with 1 1 a||2 = 1, we have 

\i=i 

Index the elements of a by a = (0(1,2)50(1^3), ■ ■ ■ , a(p_i,p))"^, and for each I < i < p, 
define G TU' by = (a(i^j),-- - , a(j_i,j), 0, a(j_j+i), ■ ■ ■ ,a(jp))'^. Then by definition 
xja = y^tti. Also note that Yl^i=i \ \'^i\W = 2||a||2 = 2. This is because, for i 7^ j, the 
jt/i entry of appears exactly twice in a. Therefore 

p p p 

a^L {9)a = "^WiEg [afyy'^ai) = ^ty^afSai > '^WiX,,,in{^)\\ai\\l > 2i2;oAmin(^), 

1=1 i=l i=l 

where E = Var(y) and wq = Wo/aoo- Similarly L {9)a < 2wooAmax(^), with 
Woo = Woo/cTo. By CI, S has bounded eigenvalues, thus Bl is proved. 



proof of Bl.l: obvious. 



proof of B1.2: note that Var(^ 5.)(ej(6', a)) = l/a" and Var(g^^)(?/j) = cr". Then for any 
1 ^ ^ < J ^ by Cauchy-Schwartz 



< E(^e^,){w^e^{9, a)y]) + E^s^,){w]e]{9, a)y}) 
(o-««)3 (a-Ji)3 a-'"a^'J' 
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The right hand side is bounded because of CO and BO. 



proof of B1.3: for (i, j) G A, denote 



-1 



Then D ^ is the entry in L_^^{6) . Thus by Bl, D ^ is positive and 

bounded from above, so D is bounded away from zero. 



2^' max 



-2\ 



12 ^ \\^ij,A 

is bounded from above, thus it suffices to show that IIL, 



By 



ij,A 
■II 



proof of B 14: note that \\^ij^A^Q)\Lj^jSQ)Y^\\l < 

Bl, Amax([-^A4( 

is bounded. Since G define y4+ := (2,j)U^- Then 

is the inverse of the (1, 1) entry of _4+(^). Thus by Bl, it is bounded away from 
zero. Therefore by Bl.l, L^'^-_4(^)[L^_4(^)]~^L^ is bounded from above. Since 

> \K,M\\l^^A^'AAm-'). andby Bl, Xr^A^'AAm-'' . 
is bounded away from zero, we have H-^^jj .4(^)112 bounded from above. 

proof of B2: the (fc, /)-th entry of the matrix yifjjyy'^ is yiyjfjkyi, for 1 < k < I < p. 
Thus, the {k, /)-th entry of the matrix E[yiyjyy'^] is E[yiyjykyi] = ^ij^ki+o^ik^ji+^u^jk- 



Thus, we can write 



(S-3) 



where di. is the p x 1 vector (5"jfc)^^j^. From (]S-3p . we have 



O'i- 2 CTf 2, 



(S-4) 
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where || ■ || is the operator norm. By CO-Cl, the first term on the right hand side is 
uniformly bounded. Now, we also have, 



o'ii - 0-i. S(„j)5-i. > 



(S-5) 



where is the submatrix of SI removing i-th row and column. From this, it follows 
that 



1/2 f.-l/2~ 



a. 



i- \\2 



< 
< 



.1/2 II II 
II II 



^(-i) ^i- Il2 



(S-6) 



where the last inequality follows from (18-50 . and the fact that 5](„j) is a principal 
submatrix of S. Thus the result follows by applying (1S-6P to bound the last term in 

(H). 

proof of B3 : 



n 

-E 

n ^ 

1=1 



-Wi 



-Wk Vk 



.kk 



E 




kkp''y\ 



:y\- 



kk^i' 



Thus, 



-Wi 



-Wk 



ViVk 



ViVk 





■^kk 




/ CT" 








V ^kk 



kk 



a 



kk 



a 



kk 
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where for I < i, j < p, yiy] := ^ X]r=i ylVj- ^ij denote the (i, j)-th element of the 
true covariance matrix S. By CI, {aij '■ 1 < j < p] are bounded from below and 
above, thus 



max \yiyj - aij\ = Op( 

l<i,j<p 



logn, 



n 



(Throughout the proof, Op(-) means that for any r/ > 0, for sufficiently large n, the left 
hand side is bounded by the order within Op{-) with probability at least 1 — 0(n~'').) 
Therefore 



Y] IVjVk-c^jkWp'^ < (V \p'^) max IViVj-aijl < ( /g„ ^(p'^T) max IViVj-aijl = o(l). 



where the last inequality is by Cauchy-Schwartz and the fact that, for fixed z, there 
are at most g„ non-zero p*-'. The last equality is due to the assumption g„ ~ ^(e^ti)) 
and the fact that 'Ylij^i{p^^Y is bounded which is in turn implied by condition CI. 
Therefore, 



\L'{la,Y)-L'{la,Y) 



< {wi\aik\ + Wk\aik\) max 

i.k 



kk 



kk 



[WiTki + WkTik) max 



+ Rni 



where r^j := 'Ylj^ki \'^jkP^^\i ^"^^ the reminder term i?„ is of smaller order of the leading 
terms. Since CI implies BO, thus together with condition D, we have 



max 

l<i,fc<p 



a" 

Zlk 



a" 



a 



kk 



Or. 



logn, 



n 



max 

l<i,j,k<p 



a 



kk 



a 



kk 



Or. 



I log n ^ 



n 
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Moreover, by Cauchy-Schwartz 



jk) 



and the right hand side is uniformly bounded (over [i, k)) due to condition CI. Thus 
by CO, CI and D, we have showed 



max \L'^ je, a, Y) - L'„ JO, a,Y)\ = O 



logn. 



n 



Observe that, forl<i<A;<p, l<t<s<p 



T" 

n,ik,ts 



if otherwise. 

Thus by similar arguments as in the above, it is easy to proof the claim. 

Part II 

In this section, we proof the main results (Theorems 1-3). We first give a few lemmas. 

Lemma S-1 (Karush-Kuhn- Tucker condition) 6 is a solution of the optimization 
problem 

arg min L„(6', a, Y) + A„| |6'| |i, 
where S is a subset of T := {{i,j) : 1 < < J < p}, if and only if 



A„sign(%), if 9ij ^ 



L'Ma,Y)\ < A., if %=0, 
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for G S. Moreover, if the solution is not unique, \L'j^ -{6,a,Y)\ < A„, for some 
specific solution 6 and L'^^j{6,a,Y) being continuous in 6 imply that 6ij = for 
all solutions 9. (Note that optimization problem (9) corresponds to S = T and the 
restricted optimization problem (11) corresponds to S = A.) 



Lemma S-2 For the loss function defined by liS-B) . if conditions CO- CI hold and 
condition D holds for a and if Qn ~ '^ii^^)' then for any rj > 0, there exist constants 
Co,r], Ci,rj, C2,r), Cs,^ > 0, such that for any u G i?^" the following hold with probability as 
least 1 — 0{n~^) for sufficiently large n: 



n 



I Tr/ (-n - vM ^ II II ^ (IrJogn 

\u L„,^(^,cr, Y)| < ci,^||m||2(Y — - — ) 



logn^ 



n 



l^n,A4(^>^>Y)M-L^(^)M||2 < C3,^||M||2(g« '^^^^^ 



in I 

n 



proof of Lemma \S-B If we replace a by a on the left hand side, then the above 
results follow easily from Cauchy-Schwartz and Bernstein's inequalities by using B1.2. 
Further observe that, 

\\L'^^^(e, a, Y) 1 12 < I \K,^M Y) 1 12 + WK^je, a, Y) - a, Y) \\,, 

and the second term on the right hand side has order since there are g„ 

terms and by B3, they are uniformly bounded by sj'^^- The rest of the lemma can 
be proved by similar arguments. 

The following two lemmas are used for proving Theorem 1. 
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Lemma S-3 Assuming the same conditions of Theorem 1. Then there exists a con- 
stant Ci{9) > 0, such that for any t] > 0, the probability that there exists a local 
minima of the restricted problem (11) within the disc: 



{e:\\e-d\\^<c,{d)^nK]. 

is at least 1 — 0{n~^) for sufficiently large n. 

proof of Lemma Let a„ = ^/q^Xn, and Qn{0,d,Y , \n) = L„(6',a, Y) + A„||6'||i. 
Then for any given constant C > and any vector u E RF such that u_4c = and 
||m||2 = C, by the triangle inequahty and Cauchy-Schwartz inequahty, we have 

ll^lli - ||^ + < Q;„||n||i < Can^/(h- 

Thus 

(5„(^ + a„u,a,Y, A„) - a, Y, A„) 

= {L„(^ + a„n,a,Y)-L„(^,a,Y)}-A„{||^||i-||^ + a„n||i} 
> + Y) - Y)} - Cany/q^K 

= {L^(e + anU,a,Y)-Lr,{e,a,Y)}-Cal 

Thus for any rj > 0, there exists Ci C2^n > 0, such that, with probabihty at least 
1 - 0{n-i) 

Ln(9 + anU, a, Y) - 1^(6, a, Y) = a„u^L^ a, Y) + ^alu^L'^^^^(9, a, Y)n^ 
= ^"nW^^AA + (^nU^L'^^j^{9, a, Y) + ^alu^ [K,AAi^^ Y) - ua 
> \<^lu\L"j^{d)uj^ - Ci,r,(a„gy^n^^/Vlog^) - C2,^(«^gnn"^/^ Vlogn). 
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In the above, the first equation is because the loss function L(6,a,Y) is quadratic 
in 6 and Uj^c = 0. The inequality is due to Lemma IS-21 and the union bound. By 
the assumption ^nJx^ oo, we have UnQn'^ n"^/"^ y/\og n = o(a;„ y^A„) = o(a^). 



Also by the assumption that g„ ~ o{^/ri/\ogn), we have a^qnU ^/^i/logn = o(a^). 
Thus, with n sufficiently large 



Qn{0 + a„u, a, Y, A„) - Qn{0, a, Y, A„) > ^a^u^L^_4(6')n^ - Ca^ 



with probability at least 1 - 0(ra-''). By Bl, u^L^^{e)uj( > ^^M\\ua\\1 = 
A^i^(^)C^. Thus, if we choose C = 4/ A^^^{9) + e, then for any r] > 0, for sufficiently 
large n, the following holds 



inf Qn{9 + anU, a, Y, A„) > Qn{0, a, Y, A„) 

u:u^c=0,\\u\\2=C 



with probability at least 1 — 0{n This means that a local minima exists within 
the disc {6 : ||^ — ^||2 < Can = C^yq^Xn} with probability at least 1 — O^n""^). 

Lemma S-4 Assuming the same conditions of Theorem 1. Then there exists a con- 
stant 02(6) > 0, such that for any t] > 0, for sufficiently large n, the following holds 
with probability at least 1 — 0{n^^'): for any 9 belongs to the set S = {9 : ||^ — ^||2 > 
C2{9)^/q^Xn,9j(c = 0}, it has ||L^_4(6', a, Y)||2 > y/q^K- 



proof of Lemma S-4 Let a„ = Jq^Xn- Any 9 belongs to S can be written as: 9 



9 + a„M, with M_4c = and ||m||2 > C*2(^). Note that 

L;,^(^,a,Y) = L'^^,,{9,a,Y) + anLlj^{9,d,Y)u 

= L'^^a(9, a, Y) + a, Y) - l'^(9))u + aX^mu. 

By the triangle inequality and Lemma IS-2t for any 77 > 0, there exists constants 
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Co,Ti, c^,r) > 0, such that 



||L'„_4(6',a, Y)||2 > an||L^^(6'))n||2-co,r,(gy^n \/ log n) -cs,^ 1 1 n 1 1 2 {anqnU ^^"^^/logn) 

with probabihty at least 1 — O^n""^). Thus, similar as in Lemma [8-31 for n sufficiently 
large, \\L'^^^{6,a,Y)\\2 > |L^(6'))u| I2 with probability at least 1 — 0{n^'^). By 
Bl, \\L^^(e))u\\2 > A^i,(^)||M||2. Therefore 02(6) can be taken as 2/A^-^^^{6) + e. 
The following lemma is used in proving Theorem 2. 

Lemma S-5 Assuming conditions CO-Cl. Let D_4^j((9,Y) = L'^ _^_^(9,Y) — L'^_^(9). 

Then there exists a constant K2{0) < 00, such that for any {k, I) G A, Amax(Varg(Z}_4 Y))) < 

K2{e). 

proof of Lemmam Var,(D^,H(^, 1^)) = ^^^(MUhI^, 

Thus it suffices to show that, there exists a constant K2{6) > 0, such that for all {k, I) 

Use the same notations as in the proof of Bl. Note that L'l_^j.i{6, Y) = x'^w'^X(k,i) = 
WkViXk + wiVkXi- Thus 

E-e{L'lA,kA^^ ^)^'i,Afc«(^' = '^I'^iy'i^kxl] + wf^lvlxixj] + WkWiE[ykyi{xkxf + xix^)], 
and for a G TZp^p-^^^^ 

= wlalE[yfyy'^]ak + wf af]K[ylyy'^]ai + 2wkWialE[ykyiyy'^]ai. 
Since ELi ll^fclli = 2||a||^, and by B2: Amax(E[yiyjyy^]) < A'i(^) for any 1 < « < 
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j < p, the conclusion follows. 

proof of Theorem 1: The existence of a solution of (11) follows from Lemma [S-31 By 
the Karush-Kuhn- Tucker condition (Lemma IS-ip . for any solution 6 of (11), it has 
||L'„_^(^,a,Y)|U < A„. Thus ||L:,_^(^, a, Y)||2 < ^||L;_^(^,a, Y)|U < v^A„. 
Thus by Lemma [S-41 for any t] > 0, for n sufficiently large with probability at least 
1 — 0{n^^), all solutions of (11) are inside the disc {9 : 116* — ^||2 < C2{0)y/q^Xn}. 

Since -^^^=^ ^ oo, for sufficiently large n and G A: 9ij > s„ > 2C2(6')y^A„. 

Thus 

i-o{n-^) < Pg{\\e-^''- -e^\\2 < C2(e)^Xn,e., > 2C2{e)y^Xn, for aii(z,j) e ^) 

< Pg (sign(^^'"") = sign(%), for all(z, j) G . 

proof of Theorem 2: For any given 77 > 0, let 77' = + k. Let Sn = {sign(6'-^'^") = 
sign(^)}. Then by Theorem 1, Pg{£n) > 1 — 0{n^^') for sufficiently large n. On En, 
by the Karush-Kuhn- Tucker condition and the expansion of L'^ ^{6-^'''^'\a ,Y) at 6 

-A„sign(^^) = L:,,_4(^-^'^",a,Y) = L;,^(^,a,Y) + L:,^^(^,a,Y)z/„ 

= L"_^{e)v^ + L;,^(^; a, Y) + a, Y) - L';^{e)) 

where Un := 6j^^" — Ojs,. By the above expression 

l^n = -\n\t'^AiP)]~^^^^<^A) - [i;;^(^)]-M^'n,^(^,^,Y) +D„,^(^,a,Y)z/„], (S-7) 

where Dri,A4(^) S", Y) = L'l^ ^^{9 ,d ,Y) — l'^{6). Next, fix (z,j) G A'^, and consider 
the expansion of .^•(6'-^'^", a, Y) around 9: 

L;,^.(^-^'^", a, Y) = L'^^,^{e, a, Y) + L^.^^A^, Y)z/„. (S-8) 
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Then plug in dSH]) into dMj), we get 



+ L'^,^{e,a,Y) + Y) -i;.^(^")[i;;^(^")]-iZ}„,^^(^",a, Y)] (S-9) 

By condition C2, for any E A^: |l'4_4(^)[I^_4(^)]-isign(^^)| < 5 < 1. Thus it 
suffices to prove that the remaining terms in (18-9^ are all o(A„) with probability at 
least 1 — 0(n~^') (uniformly for all (i, j) G A"^). Then since l^'^l < p ~ 0{n'^), by the 
union bound, the event max(j ,,)g_4c jj(6'-^'^", a, Y)| < A„ holds with probability at 
least 1 — 0{n'^~''^') = 1 — 0{n~''^), when n is sufficiently large. 

By B1.4, for any (z, j) G A^: \\L'!j,Ai(^)[L'^{e)]-% < M{e). Therefore by Lemma 
\S-2\ for any r] > 0, there exists a constant Ci^r^ > 0, such that 



max \Lljm'AAm~'L'^,A(^^^^^)\ < CiMr-^^) = KAn)) 
{i,j)eA'^ \ n 



with probability at least l—0{n ^). The claim follows by the assumption y " ~ 
o(A„). 

By B1.2, ||Vare-(L^(^,a, Y))||2 < Mi{9). Then similarly as in Lemma [S3 for 
any 77 > 0, there exists a constant C2,r] > 0, such that maxij \L'^^j{9,d,Y)\ < 



C2,Ti{\/^^^) = (c(A„)), with probability at least 1 — 0{n The claims follows 



by the assumption that Any'j^^ 00. 

Note that by Theorem 1, for any ?7 > 0, Hz^nlb < C{6)y/q^Xn with probability at 
least 1 — 0{n^^) for large enough n. Thus, similarly as in Lemma [8-21 for any 77 > 0, 



there exists a constant Cg^^, such \Dn,ij,A{(^, a, Y)z/„| < C-^^riiy '^"^°^" A/9^-^n)(= o(A„)), 
with probability at least 1 — 0(n~''). The claims follows from the assumption 
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Finally, let 6^ = \L-j ^{6)[Lj^{6)] ^. By Cauchy-Schwartz inequality 
\b'^Dn,AA{^,^,^)^n\ < \\b^Dn,AA{0,a,Y)\\2\\iyn\\2 < QuK max |6^£)„,^,fc/(^, o", Y)|. 

{k,l)£A 

In order to show the right hand side is o(A„) with probability at least 1 — 0{n~^), it 
suffices to show max(fc^/)g_4 |6-^D„^_4^fc/(i9, a, Y)| = 0{\J'^^) with probability at least 

1 — 0(n~''), because of the the assumption g„ ~ ^(y^Egn)- This is implied by 

E-e{\lFDj,^ui{e,aX)?) < 1 16| |^A,,,,(Var,-(Z}^,fc,(^, ^, 

being bounded, which follows immediately from B1.4 and Lemma [S-5[ Finally, simi- 
larly as in Lemma [S-21 

\b^Dn,AA{0, a, Y)iy^\ < |6'^D„,A4(^, (r, Y)i.„| + |6'^(D„,^(^, a, Y)-A.,A4(^, Y))i.„|, 

where by B3 , the second term on the right hand side is bounded by Op ( \J^^^) 1 1 ^ 1 1 2 1 1 1 1 2 • 
Note that 1 16| I2 ~ thus the second term is also of order o(A„) by the assumption 
g„ ~ j^)- This completes the proof. 

proof of Theorem 3: By Theorems 1 and 2 and the Karush-Kuhn- Tucker condition, 
for any 77 > 0, with probability at least 1 — 0{n~^), a solution of the restricted 
problem is also a solution of the original problem. On the other hand, by Theorem 

2 and the Karush-Kuhn- Tucker condition, with high probability, any solution of the 
original problem is a solution of the restricted problem. Therefore, by Theorem 1, 
the conclusion follows. 
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Part III 



In this section, we provide details for the implementation of space which takes ad- 
vantage of the sparse structure of X. Denote the target loss function as 



1 n ,,9 

- y-xe f 

2 " " 



(S-IO) 



i<j 



Our goal is to find 9 = argming/(^) for a given Ai. We will employ active-shooting 
algorithm (Section 2.3) to solve this optimization problem. 

Without loss of generality, we assume mean(Yj) = ^/nY^^^iUi = for z = 
1, . . . ,p. Denote C,i = YfYj. We have 



■3J 



Denote p*-' = P(ij)- We now present details of the initialization step and the updating 
steps in the active- shooting algorithm. 

1. Initialization 

Let 



(0) _ (|y^^(...)|-Ai)^-sign(y^A'(,,,.)) 



For j = 1, . . . ,p, compute 



Ai .sign{YfYj) 



^3 ^11 "rs2 _jj 



(S-11) 



KO) 



aPP, 
an 



(0) 



(S-12) 
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and 



i?(°) = 3^-5?(°) = ((i^rf,...,(4°)f) 



where Ef = Yj - Y}"^ for 1 < j < p. 



(0) 



2. Update p|S) — pS) 

Let 



(0)nT 



Aj.) = (El 



{0)\T 



We have 



iO)\T 



an ^ i 



It follows 



(1) . , (0) 



Sign 



5j ^ir +5» 



(^'°')^^(..) , (0) 



3. Update p 



P 



(t+i) 



From the previous iteration, we have 



^(t 1). residual in the previous iteration {np x 1 vector). 



(zq, jo)- index of coefficient that is updated in the previous iteration. 



{t-D 



if ihj) (^0, jo), nor (jo,io) 



if (^'i) = (^0, jo), or (jo,io) 
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Then, 



E 



(t) 



E^^ for k^iojo] 



E. 



it) 

jo 



E. 



(i-l) 
(i-l) 



" 30 



Y 



E. 



(t) 

io 



E^o 

30 



Z^i=l Y ^^\y(i,jo) f^{i,3o)^ 



+ 



fT'0'0 



Y,-„ ■ A; 



E, 



«0 



(jJOJO 



Y,„ ■ A. 



(S-18) 



7»0*0 JO 

Suppose the index of the coefficient we would hke to update in this iteration is 
then let 



{h,ji) 



{E^ 



a 



Jill 



11 1 



{t)\T 



A3.M) = (^D 



Cr3i3i 



a 



■111 ■'^ 



We have 



Sign 



^(ji.n)+^('i Ji) 



j-JUl 

n)+^('iJi) 



(S-19) 



Using the above steps 1-3, we have implemented the active-shooting algorithm 
in c, and the corresponding R package space to fit the space model is available on 
cran. 
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